39 research outputs found

    EOLE: Toward a Practical Implementation of Value Prediction

    Get PDF
    International audienceA new architecture, Early/Out-of-Order/Late Execution (EOLE), leverages value prediction to execute a significant number of instructions outside the out-of-order engine. This approach reduces the issue width, which is a major contributor to both out-of-order engine complexity and the register file port requirement. This reduction paves the way for a truly practical implementation of value prediction

    Rebasing Microarchitectural Research with Industry Traces

    Get PDF
    Microarchitecture research relies on performance models with various degrees of accuracy and speed. In the past few years, one such model, ChampSim, has started to gain significant traction by coupling ease of use with a reasonable level of detail and simulation speed. At the same time, datacenter class workloads, which are not trivial to set up and benchmark, have become easier to study via the release of hundreds of industry traces following the first Championship Value Prediction (CVP-1) in 2018. A tool was quickly created to port the CVP-1 traces to the ChampSim format, which, as a result, have been used in many recent works. In this paper, we revisit this conversion tool and find that several key aspects of the CVP-1 traces are not preserved by the conversion. We therefore propose an improved converter that addresses most conversion issues as well as patches known limitations of the CVP-1 traces themselves. We evaluate the impact of our changes on two commits of ChampSim, with one used for the first Instruction Championship Prefetching (IPC-1) in 2020. We find that the performance variation stemming from higher accuracy conversion is significant

    EOLE: Paving the Way for an Effective Implementation of Value Prediction

    Get PDF
    A fait l'objet d'une publication au "International Symposium on Computer Architecture (ISCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/ISCA%2714_EOLE.pdfEven in the multicore era, there is a continuous demand to increase the performance of single-threaded applications. However, the conventional path of increasing both issue width and instruction window size inevitably leads to the power wall. Value prediction (VP) was proposed in the mid 90's as an alternative path to further enhance the performance of wide-issue superscalar processors. Still, it was considered up to recently that a performance-effective implementation of Value Prediction would add tremendous complexity and power consumption in almost every stage of the pipeline. Nonetheless, recent work in the field of VP has shown that given an efficient confidence estimation mechanism, prediction validation could be removed from the out-of-order engine and delayed until commit time. As a result, recovering from mispredictions via selective replay can be avoided and a much simpler mechanism - pipeline squashing - can be used, while the out-of-order engine remains mostly unmodified. Nonetheless, VP and validation at commit time entail strong constraints on the Physical Register File. Write ports are needed to write predicted results and read ports are needed in order to validate them at commit time, potentially rendering the overall number of ports unbearable. Fortunately, VP also implies that many single-cycle ALU instructions have their operands predicted in the front-end and can be executed in-place and in-order. Similarly, the execution of single-cycle instructions whose result has been predicted can be delayed until just before commit since predictions are validated at commit time. Consequently, a significant number of instructions - 10% to 60% in our experiments - can bypass the out-of-order engine, allowing the reduction of the issue width, which is a major contributor to both out-of-order engine complexity and register file port requirement. This reduction paves the way for a truly practical implementation of Value Prediction. Furthermore, since Value Prediction in itself usually increases performance, our resulting {Early | Out-of-Order | Late} Execution architecture (EOLE), is often more efficient than a baseline VP-augmented 6-issue superscalar while having a significantly narrower 4-issue out-of-order engine.MĂȘme Ă  l'Ăšre des multicoeurs, il existe une demande continue pour l'augmentation de la performance sur les applications mono-threads. Cependant, la solution conventionnelle consistant Ă  augmenter la largeur d'exĂ©cution ainsi que la taille de la fenĂȘtre d'instructions se heurte inĂ©vitablement au mur de la consommation. La PrĂ©diction de Valeurs (VP) a Ă©tĂ© proposĂ©e dans les annĂ©es 90 comme une alternative permettant d'amĂ©liorer la performance des processeurs superscalaires. Cela Ă©tant, une implĂ©mentation intĂ©ressante du point de vue cout-efficacitĂ© Ă©tait jusqu'ici considĂ©rĂ©e comme impossible Ă  cause de la complexitĂ© ainsi que de la consommation induite. Cependant, des travaux rĂ©cents dans le domaine de la PrĂ©diction de Valeurs ont montrĂ©s qu'avec un mĂ©canisme d'estimation de la confiance efficace, la validation d'une prĂ©diction pouvait ĂȘtre repoussĂ©e au moment ou l'instruction est retirĂ©e du pipeline. ConsĂ©quemment, rĂ©cupĂ©rer d'une mauvaise prĂ©diction via une rĂ©-exĂ©cution sĂ©lective peut-ĂȘtre Ă©vitĂ© et un mĂ©canisme bien plus simple - vidage du pipeline - peut-ĂȘtre utilisĂ©. Toute la partie du processeur chargĂ©e d'exĂ©cuter les instructions dans le dĂ©sordre n'est donc pas modifiĂ©e. NĂ©anmoins, VP et la validation au retirement impliquent des contraintes fortes sur le fichier de registres. Des ports d'Ă©criture sont requis pour Ă©crire les prĂ©dictions et des ports de lecture sont requis pour valider les prĂ©dictions au retirement. Heureusement, VP implique aussi que beaucoup d'instructions simples ont leurs opĂ©randes disponibles tĂŽt dans le pipeline et peuvent ĂȘtre exĂ©cutĂ©es dans l'ordre. De façon similaire, l'exĂ©cution des instructions simples ayant Ă©tĂ© prĂ©dites peut ĂȘtre reportĂ©e aux derniers Ă©tages du pipeline puisque les prĂ©dictions sont validĂ©es au retirement. Au final, une proportion significative des instructions - 10% to 60% dans notre Ă©tude - peuvent contourner le moteur d'exĂ©cution dans le dĂ©sordre, ce qui permet de rĂ©duire la largeur d'exĂ©cution, qui contribue grandement Ă  la complexitĂ© du processeur. Cette rĂ©duction ouvre la porte Ă  une implĂ©mentation rĂ©aliste de la PrĂ©diction de Valeurs. De plus, puisque la VP augmente la performance, notre architecture {Early | Out-of-Order | Late} Execution architecture (EOLE), est souvent plus performante qu'une architecture superscalaire implĂ©mentant la VP tout en ayant un moteur d'exĂ©cution dans le dĂ©sordre bien moins complexe

    Practical Data Value Speculation for Future High-end Processors

    No full text
    A fait l'objet d'une publication à "High Performance Computer Architecture (HPCA) 2014" Lien : http://people.irisa.fr/Arthur.Perais/data/HPCA%2714_Practical_VP.pdf; Dedicating more silicon area to single thread performance will necessarily be considered as worthwhile in future - potentially heterogeneous - multicores. In particular, Value prediction (VP) was proposed in the mid 90's to enhance the performance of high-end uniprocessors by breaking true data dependencies. In this paper, we reconsider the concept of Value Prediction in the contemporary context and show its potential as a direction to improve current single thread performance. First, building on top of research carried out during the previous decade on confidence estimation, we show that every value predictor is amenable to very high prediction accuracy using very simple hardware. This clears the path to an implementation of VP without a complex selective reissue mechanism to absorb mispredictions, where prediction is performed in the in-order pipeline frond-end and validation is performed in the in-order pipeline back-end, while the out-of-order engine is only marginally modified. Second, when predicting back-to-back occurrences of the same instruction, previous context-based value predictors relying on local value history exhibit a complex critical loop that should ideally be implemented in a single cycle. To bypass this requirement, we introduce a new value predictor VTAGE harnessing the global branch history. VTAGE can seamlessly predict back-to-back occurrences, allowing predictions to span over several cycles. It achieves higher performance than previously proposed context-based predictors. Specifically, using SPEC'00 and SPEC'06 benchmarks, our simulations show that combining VTAGE and a Stride-based predictor yields up to 65% speedup on a fairly aggressive pipeline without support for selective reissue.; Dédier plus de surface de silicium à la performance séquentielle sera nécessairement considéré comme digne d'interÃÂȘt dans un futur proche. En particulier, la Prédiction de Valeurs (VP) a été proposée dans les années 90 afin d'améliorer la performance séquentielle des processeurs haute-performance en cassant les dépendances de données entre instructions. Dans ce papier, nous revisitons le concept de Prédiction de Valeurs dans un contexte contemporain et montrons son potentiel d'amélioration de la performance séquentielle. Spécifiquement, utilisant les suites de benchmarks SPEC'00 et SPEC'06, nos simulations montrent qu'en combinant notre prédicteur, VTAGE, avec un prédicteur de type Stride, des gains de performances allant jusqu'à 65% peuvent ÃÂȘtre observés sur un pipeline relativement agressif mais sans ré-exécution sélective en cas de mauvaise prédiction. Document type: External research repor

    Leveraging Targeted Value Prediction to Unlock New Hardware Strength Reduction Potential

    No full text
    International audienceValue Prediction (VP) is a microarchitectural technique that speculatively breaks data dependencies to increase the available Instruction Level Parallelism (ILP) in general purpose processors. Despite recent proposals, VP remains expensive and has intricate interactions with several stages of the classical superscalar pipeline. In this paper, we revisit and simplify VP by leveraging the irregular distribution of the values produced during the execution of common programs. First, we demonstrate that a reasonable fraction of the performance uplift brought by a full VP infrastructure can be obtained by predicting only a few "usual suspects" values. Furthermore, we show that doing so allows to greatly simplify VP operation as well as reduce the value predictor footprint. Lastly, we show that these Minimal and Targeted VP infrastructures conceptually enable Speculative Strength Reduction (SpSR), a rename-time optimization whereby instructions can disappear at rename in the presence of specific operand values

    Exploiting Value Prediction With Quasi-Unlimited Resources

    No full text
    Recent trends regarding general purpose microprocessors have focused on Thread-Level Parallelism (TLP), and in general, on parallel architectures such as multicores. However, due to Amdahl's law, the gain to be had from the parallelization of a program is limited since there will always be an incompressible sequential part in the program. The execution time of this part only depends on the sequential performance of the processor the program is executed on. Value Prediction was proposed in the late 90's as a way to improve sequential performance by predicting instructions results, allowing the hardware to break data dependencies between instructions and thus extract more Instruction Level Parallelism (ILP) from the code. In the meantime, very accurate Geometric Length indirect branch target predictor such as ITTAGE were proposed. Indirect Branch Target Prediction and Value Prediction exhibit some similarities in concept, which is why we present a value predictor borrowing from both the Geometric Length indirect target branch predictor ITTAGE and existing work in the field of Value Prediction. As transistor budget is not expected to be a problem for future microprocessors, we study the behavior of the Value TAGE predictor for both finite and ''infinite'' sizes. We evaluate VTAGE performance on standard integer and floating-point workloads as well as on vectorized code

    A Case for Speculative Strength Reduction

    No full text
    International audienceMost high performance general purpose processors leverage register renaming to implement optimizations such as move elimination or zero-idiom elimination. Those optimizations can be seen as forms of strength reduction whereby a faster but semantically equivalent operation is substituted to a slower operation. In this letter, we argue that other reductions can be performed dynamically if input values of instructions are known in time, i.e.,~prior to renaming. We study the potential for leveraging Value Prediction to achieve that goal and show that in SPEC2k17, an average of 3.3% (up to 6.8%) of the dynamic instructions could dynamically be strength reduced. Our experiments suggest that a state-of-the-art value predictor allows to capture 59.7% of that potential on average (up to 99.6%)

    High Performance General Purpose Architecture and Microarchitecture

    No full text
    International audienceIn this talk, we will provide a broad overview of why general purpose processors are here to stay and some research directions pursued by the Computer Architecture community. We will also briefly present recent work done at TIMA to improve the performance of general purpose processors.</p

    La prédiction de valeurs comme moyen d'augmenter la performance des processeurs superscalaires

    No full text
    Although currently available general purpose microprocessors feature more than 10 cores, many programs remain mostly sequential. This can either be due to an inherent property of the algorithm used by the program, to the program being old and written during the uni-processor era, or simply to time to market constraints, as writing and validating parallel code is known to be hard. Moreover, even for parallel programs, the performance of the sequential part quickly becomes the limiting improvement factor as more cores are made available to the application, as expressed by Amdahl's Law. Consequently, increasing sequential performance remains a valid approach in the multi-core era. Unfortunately, conventional means to do so - increasing the out-of-order window size and issue width - are major contributors to the complexity and power consumption of the chip. In this thesis, we revisit a previously proposed technique that aimed to improve performance in an orthogonal fashion: Value Prediction (VP). Instead of increasing the execution engine aggressiveness, VP improves the utilization of existing resources by increasing the available Instruction Level Parallelism. In particular, we address the three main issues preventing VP from being implemented. First, we propose to remove validation and recovery from the execution engine, and do it in-order at Commit. Second, we propose a new execution model that executes some instructions in-order either before or after the out-of-order engine. This reduces pressure on said engine and allows to reduce its aggressiveness. As a result, port requirement on the Physical Register File and overall complexity decrease. Third, we propose a prediction scheme that mimics the instruction fetch scheme: Block Based Prediction. This allows predicting several instructions per cycle with a single read, hence a single port on the predictor array. This three propositions form a possible implementation of Value Prediction that is both realistic and efficient.Bien que les processeurs actuels possĂšdent plus de 10 cƓurs, de nombreux programmes restent purement sĂ©quentiels. Cela peut ĂȘtre dĂ» Ă  l'algorithme que le programme met en Ɠuvre, au programme Ă©tant vieux et ayant Ă©tĂ© Ă©crit durant l'Ăšre des uni-processeurs, ou simplement Ă  des contraintes temporelles, car Ă©crire du code parallĂšle est notoirement long et difficile. De plus, mĂȘme pour les programmes parallĂšles, la performance de la partie sĂ©quentielle de ces programmes devient rapidement le facteur limitant l'augmentation de la performance apportĂ©e par l'augmentation du nombre de cƓurs disponibles, ce qui est exprimĂ© par la loi d'Amdahl. ConsĂ©quemment, augmenter la performance sĂ©quentielle reste une approche valide mĂȘme Ă  l'Ăšre des multi-cƓurs.Malheureusement, la façon conventionnelle d'amĂ©liorer la performance (augmenter la taille de la fenĂȘtre d'instructions) contribue Ă  l'augmentation de la complexitĂ© et de la consommation du processeur. Dans ces travaux, nous revisitons une technique visant Ă  amĂ©liorer la performance de façon orthogonale : La prĂ©diction de valeurs. Au lieu d'augmenter les capacitĂ©s du moteur d'exĂ©cution, la prĂ©diction de valeurs amĂ©liore l'utilisation des ressources existantes en augmentant le parallĂ©lisme d'instructions disponible.En particulier, nous nous attaquons aux trois problĂšmes majeurs empĂȘchant la prĂ©diction de valeurs d'ĂȘtre mise en Ɠuvre dans les processeurs modernes. PremiĂšrement, nous proposons de dĂ©placer la validation des prĂ©dictions depuis le moteur d'exĂ©cution vers l'Ă©tage de retirement des instructions. DeuxiĂšmement, nous proposons un nouveau modĂšle d'exĂ©cution qui exĂ©cute certaines instructions dans l'ordre soit avant soit aprĂšs le moteur d'exĂ©cution dans le dĂ©sordre. Cela rĂ©duit la pression exercĂ©e sur ledit moteur et permet de rĂ©duire ses capacitĂ©s. De cette maniĂšre, le nombre de ports requis sur le fichier de registre et la complexitĂ© gĂ©nĂ©rale diminuent. TroisiĂšmement, nous prĂ©sentons un mĂ©canisme de prĂ©diction imitant le mĂ©canisme de rĂ©cupĂ©ration des instructions : La prĂ©diction par blocs. Cela permet de prĂ©dire plusieurs instructions par cycle tout en effectuant une unique lecture dans le prĂ©dicteur. Ces trois propositions forment une mise en Ɠuvre possible de la prĂ©diction de valeurs qui est rĂ©aliste mais nĂ©anmoins performante
    corecore